Introduction

Welcome to my R programming sample!

In this program, I look at the locations of voting poll sites in New York City relative to population density and determine whether population is a strong predictor of the number of voting poll sites in a census tract I then also incorporate a few demographic variables into the model to determine if there are any other predictors of the number of voting poll sites in a census tract.

Using open source NYC poll site data as well as census data, I will quality check, modify, merge, visualize, and model this data to answer my questions of 1) whether population is a statistically significant predictor of the number of voting poll sites in a census tract and 2) whether other demographic variables, such as race and income, are statistically significant predictors of the number of voting poll sites in a census tract when accounting for population.

Libraries

library(tidyverse)      # for general programming
library(tidycensus)     # for importing census data
library(tidygeocoder)   # for geocoding addresses
library(sf)             # for manipulating spatial data
library(tmap)           # for creating maps

Import data

# Voting poll site data
pollsites <- read_csv("Voting_Poll_Sites.csv")
pollsites %>% glimpse()
## Rows: 1,231
## Columns: 20
## $ BOROUGH            <chr> "BROOKLYN", NA, "QUEENS", "BROOKLYN", "BROOKLYN", "…
## $ SITE_STATUS        <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "…
## $ SITE_NAME          <chr> "PS 12", "Wyatt T. Walker Senior Housing", "Allen A…
## $ SITE_NUMBER        <chr> "11537", "11517", "10590", "11493", "11629", "11620…
## $ STREET_NUMBER      <chr> "430", "2177", "112-04", "300", "495", "549", "65-1…
## $ STREET_SUFFIX      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ STREET_NAME        <chr> "Howard Avenue", "Frederick Douglass Boulevard", "1…
## $ POSTCODE           <chr> "11233", "10026", "11433", "11223", "11216", "10040…
## $ CITY               <chr> "Brooklyn", "New York", "Jamaica", "Brooklyn", "Bro…
## $ VOTER_ENTRANCE     <chr> "430 Howard Avenue (not used for voters)", "2177 Fr…
## $ HANDICAP_ENTRANCE  <chr> "Enter on Prospect Place through school yard", "217…
## $ Latitude           <dbl> NA, NA, 40.69148, NA, NA, NA, 40.73694, NA, 40.6713…
## $ Longitude          <dbl> NA, NA, -73.78333, NA, NA, NA, -73.81325, NA, -73.9…
## $ `Community Board`  <dbl> 16, NA, 12, 15, 3, 12, 8, 9, 8, 11, 13, 3, 7, 2, 6,…
## $ `Council District` <dbl> 41, NA, 27, 47, 36, 10, 24, 29, 35, 8, 23, 17, 39, …
## $ `Census Tract`     <dbl> 363, NA, 266, 37402, 265, 277, 122702, 216, 339, 17…
## $ BIN                <dbl> 3039174, NA, 4435291, 3195077, 3050974, 1076751, 41…
## $ BBL                <dbl> 3014570032, NA, 4123220001, 3071940001, 3018090062,…
## $ NTA                <chr> "Ocean Hill", NA, "South Jamaica", "Gravesend", "Be…
## $ Location           <chr> NA, NA, "(40.691478, -73.783335)", NA, NA, NA, "(40…
# Census population data (census tract level)
nyc_pop <- tidycensus::get_acs(geography = "tract", variables = "B01003_001", state = "NY", county = c("New York", "Kings", "Queens", "Bronx", "Richmond"), geometry = TRUE)
##   |                                                                              |                                                                      |   0%  |                                                                              |=                                                                     |   1%  |                                                                              |=                                                                     |   2%  |                                                                              |==                                                                    |   3%  |                                                                              |===                                                                   |   4%  |                                                                              |====                                                                  |   5%  |                                                                              |====                                                                  |   6%  |                                                                              |=====                                                                 |   7%  |                                                                              |=======                                                               |  10%  |                                                                              |========                                                              |  12%  |                                                                              |=========                                                             |  13%  |                                                                              |==========                                                            |  14%  |                                                                              |===========                                                           |  16%  |                                                                              |============                                                          |  17%  |                                                                              |============                                                          |  18%  |                                                                              |=============                                                         |  19%  |                                                                              |==============                                                        |  19%  |                                                                              |================                                                      |  23%  |                                                                              |=================                                                     |  25%  |                                                                              |==================                                                    |  26%  |                                                                              |====================                                                  |  29%  |                                                                              |=====================                                                 |  30%  |                                                                              |=====================                                                 |  31%  |                                                                              |=======================                                               |  32%  |                                                                              |========================                                              |  35%  |                                                                              |=========================                                             |  36%  |                                                                              |==========================                                            |  37%  |                                                                              |===========================                                           |  38%  |                                                                              |===========================                                           |  39%  |                                                                              |============================                                          |  40%  |                                                                              |============================                                          |  41%  |                                                                              |=============================                                         |  42%  |                                                                              |==============================                                        |  42%  |                                                                              |==============================                                        |  43%  |                                                                              |===============================                                       |  44%  |                                                                              |===============================                                       |  45%  |                                                                              |================================                                      |  46%  |                                                                              |=================================                                     |  47%  |                                                                              |==================================                                    |  48%  |                                                                              |==================================                                    |  49%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================                                  |  51%  |                                                                              |====================================                                  |  52%  |                                                                              |=====================================                                 |  53%  |                                                                              |======================================                                |  54%  |                                                                              |=======================================                               |  55%  |                                                                              |=======================================                               |  56%  |                                                                              |========================================                              |  57%  |                                                                              |========================================                              |  58%  |                                                                              |=========================================                             |  59%  |                                                                              |==========================================                            |  59%  |                                                                              |==========================================                            |  60%  |                                                                              |===========================================                           |  61%  |                                                                              |===========================================                           |  62%  |                                                                              |============================================                          |  63%  |                                                                              |=============================================                         |  64%  |                                                                              |==============================================                        |  65%  |                                                                              |==============================================                        |  66%  |                                                                              |===============================================                       |  67%  |                                                                              |================================================                      |  68%  |                                                                              |================================================                      |  69%  |                                                                              |=================================================                     |  70%  |                                                                              |==================================================                    |  71%  |                                                                              |==================================================                    |  72%  |                                                                              |===================================================                   |  73%  |                                                                              |====================================================                  |  74%  |                                                                              |====================================================                  |  75%  |                                                                              |=====================================================                 |  76%  |                                                                              |======================================================                |  77%  |                                                                              |=======================================================               |  78%  |                                                                              |=======================================================               |  79%  |                                                                              |========================================================              |  80%  |                                                                              |========================================================              |  81%  |                                                                              |=========================================================             |  81%  |                                                                              |==========================================================            |  82%  |                                                                              |==========================================================            |  83%  |                                                                              |===========================================================           |  84%  |                                                                              |===========================================================           |  85%  |                                                                              |============================================================          |  86%  |                                                                              |=============================================================         |  87%  |                                                                              |==============================================================        |  88%  |                                                                              |==============================================================        |  89%  |                                                                              |===============================================================       |  90%  |                                                                              |================================================================      |  91%  |                                                                              |================================================================      |  92%  |                                                                              |=================================================================     |  92%  |                                                                              |=================================================================     |  93%  |                                                                              |==================================================================    |  94%  |                                                                              |===================================================================   |  95%  |                                                                              |===================================================================   |  96%  |                                                                              |====================================================================  |  97%  |                                                                              |====================================================================  |  98%  |                                                                              |===================================================================== |  98%  |                                                                              |======================================================================|  99%  |                                                                              |======================================================================| 100%
nyc_pop %>% glimpse()
## Rows: 2,327
## Columns: 6
## $ GEOID    <chr> "36005023502", "36005013500", "36005009200", "36005005400", "…
## $ NAME     <chr> "Census Tract 235.02; Bronx County; New York", "Census Tract …
## $ variable <chr> "B01003_001", "B01003_001", "B01003_001", "B01003_001", "B010…
## $ estimate <dbl> 4284, 3295, 5675, 5306, 4721, 2303, 6444, 1022, 3281, 1901, 7…
## $ moe      <dbl> 896, 699, 696, 834, 765, 368, 873, 223, 551, 414, 1328, 474, …
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-73.906 40...., MULTIPOLYGON (((…
# Other census demographic variables (census tract level)
nyc_demo <- tidycensus::get_acs(geography = "tract", variables = c(
  "B17026_001", # ratio of income to poverty level
  "B03001_003", # ethnicity: hispanic or latino
  "B03002_003", # race: white alone, not hispanic or latino
  "B03002_004", # race: black or african american alone, not hispanic or latino
  "B03002_005", # race: native american alone, not hispanic or latino
  "B03002_006", # race: asian alone, not hispanic or latino
  "B03002_007", # race: native hawaiian or pacific islander alone, not hispanic or latino
  "B03002_008", # race: other race alone, not hispanic or latino
  "B03002_009" # race: two or more races, not hispanic or latino

), state = "NY", county = c("New York", "Kings", "Queens", "Bronx", "Richmond"))
nyc_demo %>% glimpse()
## Rows: 20,943
## Columns: 5
## $ GEOID    <chr> "36005000100", "36005000100", "36005000100", "36005000100", "…
## $ NAME     <chr> "Census Tract 1; Bronx County; New York", "Census Tract 1; Br…
## $ variable <chr> "B03001_003", "B03002_003", "B03002_004", "B03002_005", "B030…
## $ estimate <dbl> 866, 958, 1545, 10, 79, 0, 17, 63, 0, 3198, 77, 1517, 0, 311,…
## $ moe      <dbl> 287, 765, 409, 18, 59, 13, 22, 231, 13, 556, 98, 484, 19, 356…

Data quality check

Poll site data

This dataset from NYC Open Data contains geographical information about each of the voting poll sites in New York City. In addition to standard data quality checks, goal here is to assess if there is any missingness that would impact the merge to the census data

# Check for full duplicates
pollsites %>% 
  group_by_all() %>%
  filter(n()>1) # 0 rows
## # A tibble: 0 × 20
## # Groups:   BOROUGH, SITE_STATUS, SITE_NAME, SITE_NUMBER, STREET_NUMBER,
## #   STREET_SUFFIX, STREET_NAME, POSTCODE, CITY, VOTER_ENTRANCE,
## #   HANDICAP_ENTRANCE, Latitude, Longitude, Community Board, Council District,
## #   Census Tract, BIN, BBL, NTA, Location [0]
## # ℹ 20 variables: BOROUGH <chr>, SITE_STATUS <chr>, SITE_NAME <chr>,
## #   SITE_NUMBER <chr>, STREET_NUMBER <chr>, STREET_SUFFIX <lgl>,
## #   STREET_NAME <chr>, POSTCODE <chr>, CITY <chr>, VOTER_ENTRANCE <chr>,
## #   HANDICAP_ENTRANCE <chr>, Latitude <dbl>, Longitude <dbl>,
## #   Community Board <dbl>, Council District <dbl>, Census Tract <dbl>,
## #   BIN <dbl>, BBL <dbl>, NTA <chr>, Location <chr>
# Check for partial duplicates on site number
pollsites %>%
  group_by(SITE_NUMBER) %>% 
  filter(n()>1) # 0 rows
## # A tibble: 0 × 20
## # Groups:   SITE_NUMBER [0]
## # ℹ 20 variables: BOROUGH <chr>, SITE_STATUS <chr>, SITE_NAME <chr>,
## #   SITE_NUMBER <chr>, STREET_NUMBER <chr>, STREET_SUFFIX <lgl>,
## #   STREET_NAME <chr>, POSTCODE <chr>, CITY <chr>, VOTER_ENTRANCE <chr>,
## #   HANDICAP_ENTRANCE <chr>, Latitude <dbl>, Longitude <dbl>,
## #   Community Board <dbl>, Council District <dbl>, Census Tract <dbl>,
## #   BIN <dbl>, BBL <dbl>, NTA <chr>, Location <chr>
# Check for missingness across all columns
colSums(is.na(pollsites))
##           BOROUGH       SITE_STATUS         SITE_NAME       SITE_NUMBER 
##                 3                 0                 0                 0 
##     STREET_NUMBER     STREET_SUFFIX       STREET_NAME          POSTCODE 
##                 0              1231                 0                 0 
##              CITY    VOTER_ENTRANCE HANDICAP_ENTRANCE          Latitude 
##                 0                 1                 0               124 
##         Longitude   Community Board  Council District      Census Tract 
##               124                 3                 3                 3 
##               BIN               BBL               NTA          Location 
##                11                11                 3               124

Since the census tract variable is not in the same format as the one in the census population data, I will be merging these two datasets using their spatial geometry. However, 134 rows are missing a latitude and a longitude, so I will geocode them to obtain their coordinates later on.

Census population data

# Check for full duplicates
nyc_pop %>% 
  group_by_all() %>%
  filter(n()>1) # 0 rows
## Simple feature collection with 0 features and 5 fields
## Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
## Geodetic CRS:  NAD83
## # A tibble: 0 × 6
## # Groups:   GEOID, NAME, variable, estimate, moe, geometry [0]
## # ℹ 6 variables: GEOID <chr>, NAME <chr>, variable <chr>, estimate <dbl>,
## #   moe <dbl>, geometry <GEOMETRY [°]>
# Check for partial duplicates on census tract
nyc_pop %>%
  group_by(GEOID) %>% 
  filter(n()>1) # 0 rows
## Simple feature collection with 0 features and 5 fields
## Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
## Geodetic CRS:  NAD83
## # A tibble: 0 × 6
## # Groups:   GEOID [0]
## # ℹ 6 variables: GEOID <chr>, NAME <chr>, variable <chr>, estimate <dbl>,
## #   moe <dbl>, geometry <GEOMETRY [°]>
# Check for missingness across all columns
colSums(is.na(nyc_pop)) # no missing values
##    GEOID     NAME variable estimate      moe geometry 
##        0        0        0        0        0        0

Thankfully, the census population data appears to be already clean.

Other census demographic data

# Check for full duplicates
nyc_demo %>% 
  group_by_all() %>%
  filter(n()>1) # 0 rows
## # A tibble: 0 × 5
## # Groups:   GEOID, NAME, variable, estimate, moe [0]
## # ℹ 5 variables: GEOID <chr>, NAME <chr>, variable <chr>, estimate <dbl>,
## #   moe <dbl>
# Check for partial duplicates on census tract and variable
nyc_demo %>%
  group_by(GEOID, variable) %>% 
  filter(n()>1) # 0 rows
## # A tibble: 0 × 5
## # Groups:   GEOID, variable [0]
## # ℹ 5 variables: GEOID <chr>, NAME <chr>, variable <chr>, estimate <dbl>,
## #   moe <dbl>
# Check for missingness across all columns
colSums(is.na(nyc_demo)) # no missing values
##    GEOID     NAME variable estimate      moe 
##        0        0        0        0        0

Prepare to merge

Geocode

In order to geocode the 124 poll site rows that are missing coordinates, I first have to create a single address variable to feed into the function. First, I need to fix the street name variable so that it has the appropriate suffixes after any numbers (ex: 144th St instead of 144 St).

# Isolate rows that need to be geocoded
to_geocode <- pollsites %>% filter(is.na(Latitude) & is.na(Longitude))

# Fix street name variable
to_geocode2 <- to_geocode %>% mutate(street_new = str_replace_all(
    STREET_NAME,
    "(\\b\\d+\\b)(?!(st|nd|rd|th))", # use regular expressions to identify numbers in the street name variable that do not already have a suffix
    function(x) {
      num <- as.integer(x) # extract the number from these rows
      suffix <- if (num %% 100 >= 11 && num %% 100 <= 13) { # address outliers 11 and 13 which end in "th" instead of "st" or "rd"
        "th"
      } else {
        switch(as.character(num %% 10), # extract the last digit of the number to assign it the proper suffix
               "1" = "st",
               "2" = "nd",
               "3" = "rd",
               "th")
      }
      paste0(num, suffix) # add the suffix to the number
    }
  ))

# Check that the new street name variable looks correct
to_geocode2 %>% 
  filter(str_detect(street_new, "\\d")) %>% # only look at rows that have a number in the street name
  count(street_new) # print all instances of this
## # A tibble: 46 × 2
##    street_new       n
##    <chr>        <int>
##  1 101st Avenue     1
##  2 104th Street     1
##  3 13th Avenue      1
##  4 13th Street      1
##  5 164th Street     1
##  6 18th Avenue      1
##  7 2nd Avenue       1
##  8 44th Avenue      1
##  9 4th Avenue       2
## 10 51st Avenue      1
## # ℹ 36 more rows
# Create singular address variable
to_geocode3 <- to_geocode2 %>% mutate(address = paste0(STREET_NUMBER, " ", street_new, ", ", CITY, ", ", "NY ", POSTCODE))

# Check that address looks correct
to_geocode3 %>% select(STREET_NUMBER, street_new, CITY, POSTCODE, address)
## # A tibble: 124 × 5
##    STREET_NUMBER street_new                   CITY        POSTCODE address      
##    <chr>         <chr>                        <chr>       <chr>    <chr>        
##  1 430           Howard Avenue                Brooklyn    11233    430 Howard A…
##  2 2177          Frederick Douglass Boulevard New York    10026    2177 Frederi…
##  3 300           Avenue X                     Brooklyn    11223    300 Avenue X…
##  4 495           Gates Avenue                 Brooklyn    11216    495 Gates Av…
##  5 549           Audubon Avenue               New York    10040    549 Audubon …
##  6 127-15        Kew Gardens Road             Kew Gardens 11415    127-15 Kew G…
##  7 1716          Southern Boulevard           Bronx       10460    1716 Souther…
##  8 270           West 89th Street             New York    10024    270 West 89t…
##  9 137           Jamaica Avenue               Brooklyn    11207    137 Jamaica …
## 10 9941          Fort Hamilton Parkway        Brooklyn    11209    9941 Fort Ha…
## # ℹ 114 more rows
# Geocode
geocoded_addresses <- to_geocode3 %>% geocode(address)

# Check that all addresses were geocoded
geocoded_addresses %>% filter(is.na(lat) | is.na(long)) %>% select(address, lat, long)
## # A tibble: 3 × 3
##   address                                                    lat  long
##   <chr>                                                    <dbl> <dbl>
## 1 110-04 Atlantic Avenue, South Richmond Hill, NY 11419       NA    NA
## 2 105-25 Horace Harding Expressway North, Corona, NY 11368    NA    NA
## 3 71-50 Parsons Boulevard, Fresh Meadows, NY 11365            NA    NA

There are three addresses that didn’t get geocoded. I’m going to manually edit these addresses and try again

# Isolate rows that didn't get geocoded
not_geo <- geocoded_addresses %>% filter(is.na(lat) | is.na(long)) %>%
  select(-c(lat, long)) # drop lat and long variables to avoid duplication later

# Fix their addresses
not_geo <- not_geo %>% mutate(address = case_when(
  address == "110-04 Atlantic Avenue, South Richmond Hill, NY 11419" ~ "110-04 Atlantic Avenue, Richmond Hill, NY 11419",
  address == "105-25 Horace Harding Expressway North, Corona, NY 11368" ~ "105-25 Horace Harding Expy, Corona, NY 11368",
  address == "71-50 Parsons Boulevard, Fresh Meadows, NY 11365" ~ "71-50 Parsons Blvd, Flushing, NY 11365"
))

# Geocode
not_geo_geocoded <- not_geo %>% geocode(address)

# Check that all addresses were geocoded
not_geo_geocoded %>% filter(is.na(lat) | is.na(long)) %>% select(address, lat, long)
## # A tibble: 0 × 3
## # ℹ 3 variables: address <chr>, lat <dbl>, long <dbl>

Now all rows are geocoded! I’m going to stack the two sets of geocoded addresses and then add them back into the original dataset.

# Remove 3 non-geocoded rows from original geocoded address dataset
geocoded_addresses2 <- geocoded_addresses %>% filter(!is.na(lat) & !is.na(long))

# Check that only 3 rows are missing
isTRUE(geocoded_addresses2 %>% nrow() == geocoded_addresses %>% nrow() - 3)
## [1] TRUE
# Stack geocoded addresses
all_geocoded_addresses <- geocoded_addresses2 %>% rbind(not_geo_geocoded)

# Check row counts
isTRUE(all_geocoded_addresses %>% nrow() == geocoded_addresses2 %>% nrow() + 3)
## [1] TRUE
isTRUE(all_geocoded_addresses %>% nrow() == geocoded_addresses %>% nrow())
## [1] TRUE
# Make columns match original dataset
all_geocoded_addresses2 <- all_geocoded_addresses %>% mutate(Latitude = lat, Longitude = long) %>% select(-c(address, lat, long, street_new))

# Remove these rows from original dataset
pollsites2 <- pollsites %>% filter(!is.na(Latitude) & !is.na(Longitude))

# Check that exactly 124 rows are missing
isTRUE(pollsites2 %>% nrow() == pollsites %>% nrow() - 124)
## [1] TRUE
# Add geocoded rows back to original dataset
pollsites3 <- pollsites2 %>% rbind(all_geocoded_addresses2)

# Check row counts
isTRUE(pollsites3 %>% nrow() == pollsites2 %>% nrow() + 124)
## [1] TRUE
isTRUE(pollsites3 %>% nrow() == pollsites %>% nrow())
## [1] TRUE
# Check that all rows have a latitude and longitude
pollsites3 %>% filter(is.na(Latitude) | is.na(Longitude))
## # A tibble: 0 × 20
## # ℹ 20 variables: BOROUGH <chr>, SITE_STATUS <chr>, SITE_NAME <chr>,
## #   SITE_NUMBER <chr>, STREET_NUMBER <chr>, STREET_SUFFIX <lgl>,
## #   STREET_NAME <chr>, POSTCODE <chr>, CITY <chr>, VOTER_ENTRANCE <chr>,
## #   HANDICAP_ENTRANCE <chr>, Latitude <dbl>, Longitude <dbl>,
## #   Community Board <dbl>, Council District <dbl>, Census Tract <dbl>,
## #   BIN <dbl>, BBL <dbl>, NTA <chr>, Location <chr>

Prepare demographic census data

I also need to prepare the demographic census data I loaded for a merge later on.

# Income to poverty ratio
nyc_income <- nyc_demo %>% filter(variable=="B17026_001") %>% mutate(Variable = "Income to poverty ratio") %>% rename(`Income to poverty ratio` = estimate)

# Check that only one variable remains
nyc_income %>% count(variable, Variable)
## # A tibble: 1 × 3
##   variable   Variable                    n
##   <chr>      <chr>                   <int>
## 1 B17026_001 Income to poverty ratio  2327
# Race
nyc_race <- nyc_demo %>% filter(variable!="B17026_001") %>% 
  mutate(Race = case_when( # Create labels for race variables
    variable=="B03001_003" ~ "Hispanic or Latino of any race",
    variable=="B03002_003" ~ "White alone, not Hispanic or Latino",
    variable=="B03002_004" ~ "Black or African American alone, not Hispanic or Latino",
    variable=="B03002_005" ~ "Native American alone, not Hispanic or Latino",
    variable=="B03002_006" ~ "Asian alone, not Hispanic or Latino",
    variable=="B03002_007" ~ "Native Hawaiian or Pacific Islander alone, not Hispanic or Latino",
    variable=="B03002_008" ~ "Other race alone, not Hispanic or Latino",
    variable=="B03002_009" ~ "Two or more races, not Hispanic or Latino",
    )) %>%
  group_by(GEOID) %>% mutate(Total_pop = sum(estimate)) %>% # create a total population for each census tract
  ungroup() %>% mutate(Pct_pop = estimate/Total_pop*100) # create percentages of each race for each census tract

# Check creation of new race variable
nyc_race %>% count(Race, variable)
## # A tibble: 8 × 3
##   Race                                                            variable     n
##   <chr>                                                           <chr>    <int>
## 1 Asian alone, not Hispanic or Latino                             B03002_…  2327
## 2 Black or African American alone, not Hispanic or Latino         B03002_…  2327
## 3 Hispanic or Latino of any race                                  B03001_…  2327
## 4 Native American alone, not Hispanic or Latino                   B03002_…  2327
## 5 Native Hawaiian or Pacific Islander alone, not Hispanic or Lat… B03002_…  2327
## 6 Other race alone, not Hispanic or Latino                        B03002_…  2327
## 7 Two or more races, not Hispanic or Latino                       B03002_…  2327
## 8 White alone, not Hispanic or Latino                             B03002_…  2327
# Pivot dataset
nyc_race2 <- nyc_race %>% select(-c(variable, moe)) %>% pivot_wider(names_from = Race, values_from = c(estimate, Pct_pop))

# Check creation of total population and percent variables
nyc_race2 %>% filter(Total_pop!=`estimate_Hispanic or Latino of any race`+`estimate_White alone, not Hispanic or Latino`+`estimate_Black or African American alone, not Hispanic or Latino`+`estimate_Native American alone, not Hispanic or Latino`+`estimate_Asian alone, not Hispanic or Latino`+`estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino`+`estimate_Other race alone, not Hispanic or Latino`+`estimate_Two or more races, not Hispanic or Latino`)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## #   estimate_Hispanic or Latino of any race <dbl>,
## #   estimate_White alone, not Hispanic or Latino <dbl>,
## #   estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## #   estimate_Native American alone, not Hispanic or Latino <dbl>,
## #   estimate_Asian alone, not Hispanic or Latino <dbl>,
## #   estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_Hispanic or Latino of any race`!= `estimate_Hispanic or Latino of any race`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## #   estimate_Hispanic or Latino of any race <dbl>,
## #   estimate_White alone, not Hispanic or Latino <dbl>,
## #   estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## #   estimate_Native American alone, not Hispanic or Latino <dbl>,
## #   estimate_Asian alone, not Hispanic or Latino <dbl>,
## #   estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_White alone, not Hispanic or Latino` != `estimate_White alone, not Hispanic or Latino`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## #   estimate_Hispanic or Latino of any race <dbl>,
## #   estimate_White alone, not Hispanic or Latino <dbl>,
## #   estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## #   estimate_Native American alone, not Hispanic or Latino <dbl>,
## #   estimate_Asian alone, not Hispanic or Latino <dbl>,
## #   estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_Black or African American alone, not Hispanic or Latino` != `estimate_Black or African American alone, not Hispanic or Latino`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## #   estimate_Hispanic or Latino of any race <dbl>,
## #   estimate_White alone, not Hispanic or Latino <dbl>,
## #   estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## #   estimate_Native American alone, not Hispanic or Latino <dbl>,
## #   estimate_Asian alone, not Hispanic or Latino <dbl>,
## #   estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_Native American alone, not Hispanic or Latino` != `estimate_Native American alone, not Hispanic or Latino`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## #   estimate_Hispanic or Latino of any race <dbl>,
## #   estimate_White alone, not Hispanic or Latino <dbl>,
## #   estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## #   estimate_Native American alone, not Hispanic or Latino <dbl>,
## #   estimate_Asian alone, not Hispanic or Latino <dbl>,
## #   estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_Asian alone, not Hispanic or Latino` != `estimate_Asian alone, not Hispanic or Latino`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## #   estimate_Hispanic or Latino of any race <dbl>,
## #   estimate_White alone, not Hispanic or Latino <dbl>,
## #   estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## #   estimate_Native American alone, not Hispanic or Latino <dbl>,
## #   estimate_Asian alone, not Hispanic or Latino <dbl>,
## #   estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino` != `estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## #   estimate_Hispanic or Latino of any race <dbl>,
## #   estimate_White alone, not Hispanic or Latino <dbl>,
## #   estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## #   estimate_Native American alone, not Hispanic or Latino <dbl>,
## #   estimate_Asian alone, not Hispanic or Latino <dbl>,
## #   estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_Other race alone, not Hispanic or Latino` != `estimate_Other race alone, not Hispanic or Latino`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## #   estimate_Hispanic or Latino of any race <dbl>,
## #   estimate_White alone, not Hispanic or Latino <dbl>,
## #   estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## #   estimate_Native American alone, not Hispanic or Latino <dbl>,
## #   estimate_Asian alone, not Hispanic or Latino <dbl>,
## #   estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …
nyc_race2 %>% filter(`Pct_pop_Two or more races, not Hispanic or Latino` != `estimate_Two or more races, not Hispanic or Latino`/Total_pop*100)
## # A tibble: 0 × 19
## # ℹ 19 variables: GEOID <chr>, NAME <chr>, Total_pop <dbl>,
## #   estimate_Hispanic or Latino of any race <dbl>,
## #   estimate_White alone, not Hispanic or Latino <dbl>,
## #   estimate_Black or African American alone, not Hispanic or Latino <dbl>,
## #   estimate_Native American alone, not Hispanic or Latino <dbl>,
## #   estimate_Asian alone, not Hispanic or Latino <dbl>,
## #   estimate_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino <dbl>, …

Merge

I will now merge the poll site data to the census population data using spatial geometry.

# Convert coordinates to a spatial object
pollsites_sf <- st_as_sf(pollsites3, coords = c("Longitude", "Latitude"), crs = st_crs(nyc_pop))

# Create "in" variables to check merge later on
pollsites_sf2 <- pollsites_sf %>% mutate(inPOLL = 1)
nyc_pop2 <- nyc_pop %>% mutate(inCENSUS = 1)

# Spatially join the two datasets
merged_df <- st_join(nyc_pop2, pollsites_sf2, left = TRUE)

# Check merge
merged_df %>% count(inPOLL, inCENSUS)
## Simple feature collection with 2 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -74.25563 ymin: 40.4961 xmax: -73.70036 ymax: 40.91771
## Geodetic CRS:  NAD83
##   inPOLL inCENSUS    n                       geometry
## 1      1        1 1231 MULTIPOLYGON (((-73.81123 4...
## 2     NA        1 1345 MULTIPOLYGON (((-73.9432 40...

All rows were merged successfully.

Now, I will merge this data to the demographic datasets I created earlier to use later on.

# Create "in" variables to check merge later on
nyc_race3 <- nyc_race2 %>% mutate(inRACE = 1)
nyc_income2 <- nyc_income %>% mutate(inINCOME = 1)
merged_df2 <- merged_df %>% mutate(inMERGE = 1)

# Merge
merged_df3 <- merged_df2 %>% full_join(nyc_race3, by = "GEOID")

# Check first merge
merged_df3 %>% count(inMERGE, inRACE) 
## Simple feature collection with 1 feature and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -74.25563 ymin: 40.4961 xmax: -73.70036 ymax: 40.91771
## Geodetic CRS:  NAD83
##   inMERGE inRACE    n                       geometry
## 1       1      1 2576 MULTIPOLYGON (((-74.21211 4...
# Merge again
merged_df4 <- merged_df3 %>% full_join(nyc_income2, by = "GEOID")

# Check final merge
merged_df4 %>% count(inMERGE, inRACE, inINCOME)
## Simple feature collection with 1 feature and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -74.25563 ymin: 40.4961 xmax: -73.70036 ymax: 40.91771
## Geodetic CRS:  NAD83
##   inMERGE inRACE inINCOME    n                       geometry
## 1       1      1        1 2576 MULTIPOLYGON (((-74.21211 4...

Now that the data is all cleaned and merged, I’m going to visualize the poll site locations relative to population density.

Visualize

I will visualize the poll site locations relative to population density one borough at a time.

Manhattan

tmap_mode("view")

manhattan_map <- tm_shape(nyc_pop %>% filter(grepl("New York County", NAME))) + tm_fill(fill = "estimate", fill.scale = tm_scale(breaks = c(0, 1000, 2000, 3000, 5000, 10000, 15000, 20000)), fill_alpha = 0.7) + tm_shape(pollsites_sf %>% filter(BOROUGH=="MANHATTAN")) + tm_dots(fill = "navy")

manhattan_map

Brooklyn

brooklyn_map <- tm_shape(nyc_pop %>% filter(grepl("Kings County", NAME))) + tm_fill(fill = "estimate", fill.scale = tm_scale(breaks = c(0, 1000, 2000, 3000, 5000, 10000, 15000, 20000)), fill_alpha = 0.7) + tm_shape(pollsites_sf %>% filter(BOROUGH=="BROOKLYN")) + tm_dots(fill = "navy")

brooklyn_map

Queens

queens_map <- tm_shape(nyc_pop %>% filter(grepl("Queens County", NAME))) + tm_fill(fill = "estimate", fill.scale = tm_scale(breaks = c(0, 1000, 2000, 3000, 5000, 10000, 15000, 20000)), fill_alpha = 0.7) + tm_shape(pollsites_sf %>% filter(BOROUGH=="QUEENS")) + tm_dots(fill = "navy")

queens_map

Bronx

bronx_map <- tm_shape(nyc_pop %>% filter(grepl("Bronx County", NAME))) + tm_fill(fill = "estimate", fill.scale = tm_scale(breaks = c(0, 1000, 2000, 3000, 5000, 10000, 15000, 20000)), fill_alpha = 0.7) + tm_shape(pollsites_sf %>% filter(BOROUGH=="BRONX")) + tm_dots(fill = "navy")

bronx_map

Staten Island

si_map <- tm_shape(nyc_pop %>% filter(grepl("Richmond County", NAME))) + tm_fill(fill = "estimate", fill.scale = tm_scale(breaks = c(0, 1000, 2000, 3000, 5000, 10000, 15000, 20000)), fill_alpha = 0.7) + tm_shape(pollsites_sf %>% filter(BOROUGH=="STATEN IS")) + tm_dots(fill = "navy")

si_map

It looks like poll sites are pretty evenly distributed in terms of population density, but I’m now going to model the data to make sure that is actually true.

Linear model

I am going to be running a linear model on the merged data to test my hypothesis that population is a predictor of the number of poll sites in a census tract.

First, I need to prepare the data for the model by identifying how many poll sites are in each census tract.

# Create new dataset with number of poll sites by census tract
census_tract <- merged_df4 %>% group_by(GEOID) %>% mutate(n_pollsite = sum(!is.na(SITE_NAME)))

# Check that there are 2327 unique census tracts
census_tract %>% distinct(GEOID) %>% nrow()
## [1] 2327
# Check that poll site numbers are within reason
max(census_tract$n_pollsite)
## [1] 5
min(census_tract$n_pollsite)
## [1] 0
# Collapse to 1 row per census tract
census_tract2 <- census_tract %>% select(estimate, GEOID, n_pollsite, `Income to poverty ratio`, `Pct_pop_Hispanic or Latino of any race`, `Pct_pop_White alone, not Hispanic or Latino`, `Pct_pop_Black or African American alone, not Hispanic or Latino`, `Pct_pop_Native American alone, not Hispanic or Latino`, `Pct_pop_Asian alone, not Hispanic or Latino`, `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino`, `Pct_pop_Other race alone, not Hispanic or Latino`, `Pct_pop_Two or more races, not Hispanic or Latino`) %>% distinct()
census_tract2 %>% nrow() # should be 2327
## [1] 2327

Next, I will run the model.

population_model <- lm(n_pollsite ~ estimate, data = census_tract2)

population_model %>% summary()
## 
## Call:
## lm(formula = n_pollsite ~ estimate, data = census_tract2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7365 -0.4799 -0.2718  0.4444  4.2715 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.010e-01  2.885e-02   3.501 0.000472 ***
## estimate    1.169e-04  6.899e-06  16.952  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6737 on 2325 degrees of freedom
## Multiple R-squared:   0.11,  Adjusted R-squared:  0.1096 
## F-statistic: 287.4 on 1 and 2325 DF,  p-value: < 2.2e-16

Population is a statistically significant predictor of the number of poll sites in a neighborhood. I am also going to add a few other demographic variables to the model to see if there are any confounding variables, or any other predictors of the number of poll sites.

population_demo_model <- lm(n_pollsite ~ estimate + `Income to poverty ratio` + `Pct_pop_Hispanic or Latino of any race` + `Pct_pop_White alone, not Hispanic or Latino` + `Pct_pop_Black or African American alone, not Hispanic or Latino` + `Pct_pop_Native American alone, not Hispanic or Latino` + `Pct_pop_Asian alone, not Hispanic or Latino` + `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino` + `Pct_pop_Other race alone, not Hispanic or Latino` + `Pct_pop_Two or more races, not Hispanic or Latino`, data = census_tract2)

population_demo_model %>% summary()
## 
## Call:
## lm(formula = n_pollsite ~ estimate + `Income to poverty ratio` + 
##     `Pct_pop_Hispanic or Latino of any race` + `Pct_pop_White alone, not Hispanic or Latino` + 
##     `Pct_pop_Black or African American alone, not Hispanic or Latino` + 
##     `Pct_pop_Native American alone, not Hispanic or Latino` + 
##     `Pct_pop_Asian alone, not Hispanic or Latino` + `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino` + 
##     `Pct_pop_Other race alone, not Hispanic or Latino` + `Pct_pop_Two or more races, not Hispanic or Latino`, 
##     data = census_tract2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7323 -0.4892 -0.2842  0.4514  4.2325 
## 
## Coefficients: (1 not defined because of singularities)
##                                                                               Estimate
## (Intercept)                                                                 -2.829e-02
## estimate                                                                     1.179e-04
## `Income to poverty ratio`                                                   -1.700e-05
## `Pct_pop_Hispanic or Latino of any race`                                     1.661e-03
## `Pct_pop_White alone, not Hispanic or Latino`                                1.948e-03
## `Pct_pop_Black or African American alone, not Hispanic or Latino`            1.844e-03
## `Pct_pop_Native American alone, not Hispanic or Latino`                     -4.915e-03
## `Pct_pop_Asian alone, not Hispanic or Latino`                                5.772e-05
## `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino`  5.691e-03
## `Pct_pop_Other race alone, not Hispanic or Latino`                          -3.026e-03
## `Pct_pop_Two or more races, not Hispanic or Latino`                                 NA
##                                                                             Std. Error
## (Intercept)                                                                  4.239e-01
## estimate                                                                     2.005e-05
## `Income to poverty ratio`                                                    8.654e-05
## `Pct_pop_Hispanic or Latino of any race`                                     4.283e-03
## `Pct_pop_White alone, not Hispanic or Latino`                                4.417e-03
## `Pct_pop_Black or African American alone, not Hispanic or Latino`            4.498e-03
## `Pct_pop_Native American alone, not Hispanic or Latino`                      1.706e-02
## `Pct_pop_Asian alone, not Hispanic or Latino`                                4.450e-03
## `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino`  1.880e-02
## `Pct_pop_Other race alone, not Hispanic or Latino`                           6.999e-03
## `Pct_pop_Two or more races, not Hispanic or Latino`                                 NA
##                                                                             t value
## (Intercept)                                                                  -0.067
## estimate                                                                      5.881
## `Income to poverty ratio`                                                    -0.196
## `Pct_pop_Hispanic or Latino of any race`                                      0.388
## `Pct_pop_White alone, not Hispanic or Latino`                                 0.441
## `Pct_pop_Black or African American alone, not Hispanic or Latino`             0.410
## `Pct_pop_Native American alone, not Hispanic or Latino`                      -0.288
## `Pct_pop_Asian alone, not Hispanic or Latino`                                 0.013
## `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino`   0.303
## `Pct_pop_Other race alone, not Hispanic or Latino`                           -0.432
## `Pct_pop_Two or more races, not Hispanic or Latino`                              NA
##                                                                             Pr(>|t|)
## (Intercept)                                                                    0.947
## estimate                                                                    4.69e-09
## `Income to poverty ratio`                                                      0.844
## `Pct_pop_Hispanic or Latino of any race`                                       0.698
## `Pct_pop_White alone, not Hispanic or Latino`                                  0.659
## `Pct_pop_Black or African American alone, not Hispanic or Latino`              0.682
## `Pct_pop_Native American alone, not Hispanic or Latino`                        0.773
## `Pct_pop_Asian alone, not Hispanic or Latino`                                  0.990
## `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino`    0.762
## `Pct_pop_Other race alone, not Hispanic or Latino`                             0.666
## `Pct_pop_Two or more races, not Hispanic or Latino`                               NA
##                                                                                
## (Intercept)                                                                    
## estimate                                                                    ***
## `Income to poverty ratio`                                                      
## `Pct_pop_Hispanic or Latino of any race`                                       
## `Pct_pop_White alone, not Hispanic or Latino`                                  
## `Pct_pop_Black or African American alone, not Hispanic or Latino`              
## `Pct_pop_Native American alone, not Hispanic or Latino`                        
## `Pct_pop_Asian alone, not Hispanic or Latino`                                  
## `Pct_pop_Native Hawaiian or Pacific Islander alone, not Hispanic or Latino`    
## `Pct_pop_Other race alone, not Hispanic or Latino`                             
## `Pct_pop_Two or more races, not Hispanic or Latino`                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.684 on 2233 degrees of freedom
##   (84 observations deleted due to missingness)
## Multiple R-squared:  0.09941,    Adjusted R-squared:  0.09578 
## F-statistic: 27.39 on 9 and 2233 DF,  p-value: < 2.2e-16

Conclusion

As shown in the last linear model, the only variable that is a statistically significant predictor of the number of voting poll sites in a census tract is its population. The distribution of race and the income to poverty ratio of census tracts all do not appear to affect the number of poll sites found in that neighborhood.